Multi-Class Classification
Core Concept
Multi-class classification extends binary classification to handle three or more mutually exclusive categories, where each instance belongs to exactly one class. The model must learn decision boundaries that partition the feature space into multiple regions, one for each class. Examples include digit recognition (0–9), document categorisation by topic, species identification, or medical diagnosis across multiple disease types. This represents a more complex setting than binary classification: instead of a single boundary separating two outcomes, the model must distinguish among many alternatives while preserving the constraint that predictions are mutually exclusive.
Key Characteristics
- Multiple decision boundaries – The feature space is partitioned into C regions for C classes. Depending on the algorithm, this may be achieved by learning one boundary per class (e.g. One-vs-Rest), pairwise boundaries (One-vs-One), or a single joint partitioning (e.g. decision trees, neural networks with softmax).
- Decomposition strategies – Many binary algorithms are extended to multi-class via One-vs-Rest (OvR), which trains one classifier per class (that class vs all others) and selects the class with highest confidence, or One-vs-One (OvO), which trains a binary classifier for every pair of classes—C(C−1)/2 for C classes—and uses voting for the final prediction. Some algorithms handle multiple classes natively without decomposition.
- Softmax and cross-entropy – Neural networks for multi-class typically use a softmax output layer to convert logits into a probability distribution over all classes that sums to 1. Training usually employs cross-entropy loss. The predicted class is the one with highest probability; the full distribution provides uncertainty information.
- Evaluation metrics – Accuracy remains the proportion of correct predictions overall. Confusion matrices become C×C, showing predicted vs actual class and which classes are commonly confused. Macro-averaging computes a metric per class then averages, treating classes equally; micro-averaging aggregates across all classes, weighting frequent classes more; weighted averaging accounts for class imbalance when desired.
- Class imbalance and hierarchy – Imbalance is more complex with multiple classes—some may be well-represented while others are rare. Hierarchical classification can help when classes have natural groupings (e.g. first mammal/bird/reptile, then species). Error costs may differ between class pairs (e.g. benign vs malignant misclassification).
Common Applications
- Digit and character recognition – Assigning images or signals to one of 10 digits or a set of character classes
- Document categorisation – Assigning documents to one topic or category from a fixed set (e.g. news, sports, science)
- Species identification – Classifying specimens or images into one of many species or taxa
- Medical diagnosis (multiple conditions) – Determining which of several disease types or conditions is present from patient data
- Intent classification – Mapping user utterances or queries to one of several predefined intents
- Product categorisation – Placing items into a single category in a taxonomy
- Gesture or activity recognition – Classifying signals or video into one of several gestures or activities